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Abstract In this work we present a review of the state of the art of in¬ 
formation theoretic feature selection methods. The concepts of feature rele¬ 
vance, redundance and complementarity (synergy) are clearly defined, as well 
as Markov blanket. The problem of optimal feature selection is defined. A uni¬ 
fying theoretical framework is described, which can retrofit successful heuristic 
criteria, indicating the approximations made by each method. A number of 
open problems in the field are presented. 
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complementarity • sinergy ■ Markov blanket 


1 Introduction 

Feature selection has been widely investigated and used by the machine learn¬ 
ing and data mining community. In this context, a feature, also called attribute 
or variable, represents a property of a process or system than has been mea¬ 
sured, or constructed from the original input variables. The goal of feature 
selection is to select the smallest feature subset given a certain generalization 
error, or alternatively finding the best feature subset with k features, that 
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yields the minimum generalization error. Additional objectives of feature se¬ 
lection are: (i) improve the generalization performance with respect to the 
model built using the whole set of features, (ii) provide a more robust gen¬ 
eralization and a faster response with unseen data, and (iii) achieve a better 


and simpler understanding of the process that generates the data [31 


We will assume that the feature selection method is used either as a prepro¬ 
cessing step or in conjunction with a learning machine for classification or 
regression purposes. Feature selection methods are usually classified in three 
main groups: wrapper, embedded and filter methods [^ . Wrappers (3l| use 
the induction learning algorithm as part of the function evaluating feature 
subsets. The performance is usually measured in terms of the classihcation 
rate obtained on a testing set, i.e., the classifier is used as a black box for 
assessing feature subsets. Although these techniques may achieve a good gen¬ 
eralization, the computational cost of training the classifier a combinatorial 
number of times becomes prohibitive for high dimensional datasets. In ad¬ 
dition, many classifiers are prone to over-learning and show sensitiveness to 
initialization. Embedded methods (^ . incorporate knowledge about the spe¬ 
cific structure of the class of functions used by a certain learning machine, 
e.g. bounds on the leave-one-out error of SVMs [h^. Although usually less 
computationally expensive than wrappers, embedded methods still are much 
slower than filter approaches, and the features selected are dependent on the 
learning machine. Filter methods 17| assume complete independence between 
the learning machine and the data, and therefore use a metric independent 
of the induction learning algorithm to assess feature subsets. Filter methods 
are relatively robust against overfitting, but may fail to select the best feature 
subset for classification or regression. In the literature, several criteria have 
been proposed to evaluate single features or feature subsets, among them: in¬ 
consistency rate [ 2 ^, inference correlation j^, classification error [1^, fractal 
dimension (d^, distance measure (Hill . etc. Mutual information (MI) is a 
measure of statistical independence, that has two main properties. First, it 
can measure any kind of relation between random variables, including non¬ 
linear relationships 0 - Second, MI is invariant under transformations in the 
feature space that are invertible and differentiable, e.g. translations, rotations 
and any transformation preserving the order of the original elements of the 
feature vectors 35, 3^. Many advances in the field have been reported in the 
last 20 years since the pioneer work of Battiti [3|. Battiti defined the problem 
of feature selection as the process of selecting the k most relevant variables 
from an original feature set of m variables, k < m. Battiti proposed the greedy 
selection of a single feature at a time, as an alternative to evaluate the com¬ 
binatorial explosion of all feature subsets belonging to the original set. The 
main assumptions of Battiti’s work were the following: (a) features are clas- 
sihed as relevant and redundant; (b) an heuristic functional is used to select 
features, which allows controlling the tradeoff between relevancy and redun¬ 
dancy; c) a greedy search strategy is used; and d) the selected feature subset 
is assumed optimal. These four assumptions will be revisited in this work to 
include recent work on a) new definitions on relevant features and other types 
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of features, b) new information-theoretic functional derived from first princi¬ 
ples, c) new search strategies, and d) new definitions of optimal feature subset. 
In this work, we present a review of filtering feature selection methods based 
on mutual information, under a unified theoretical framework. We show the 
evolution of feature selection methods on the last 20 years, describing advan¬ 
tages and drawbacks. The remainder of this work is organized as follows. In 
section 2 a background on MI is presented. In section 3, the concepts of rel¬ 
evant, redundant and complementary features are defined. In section 4, the 
problem of optimal feature selection is defined. In section 5, a unified theoret¬ 
ical framework is presented, which allows us to show the evolution of different 
MI feature selection methods, as well as their advantages and drawbacks. In 
section 6, a number of open problems in the field are presented. Finally, in 
section 7, we present the conclusions of this work. 


2 Background on MI 

2.1 Notation 

In this work we will use only discrete random variables, because in practice the 
variables used in most feature selection problems are either discrete by nature 
or by quantization. Let F" be a feature set and C an output vector representing 
the classes of a real process. Let’s assume that F is the realization of a random 
sampling of an unknown distribution, where fi is the i-th variable of F and 
fi{j) is the j-th sample of vector fi. Likewise, Ci is the i-th component of C 
and Ci{j) is the j-th sample of vector Ci. Uppercase letters denote random sets 
of variables, and lowercase letters denote individual variables from these sets. 
Other notations and terminologies used in this work are the following: 

S Subset of current selected variables. 

fi Candidate feature to be added to or deleted from the subset of 

selected features S. 

{/ii/il Subset composed of the variables fi and fj. 

-^fi All variables in F except fi. -^fi = F \ fi. 

{fi, S} Subset composed of variable fi and subset S. 

'S'} All variables in F except the subset {fi, 5}. -'{fi, S'} = F\{fi, Sj 
p{fi, C) Joint mass probability between variables fi and C. 

I • I Absolute value / cardinality of a set. 

The sets mentioned above are related as follows: F = fi U S U ~^{fi,S}, 
0 = /i n S n “'{/i, S}. The number of samples in U is n and the total number 
of variables in F is m. 


2.2 Basic Definitions 

Entropy, divergence and mutual information are basic concepts defined within 
information theory El- In its origin, information theory was used within the 
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context of communication theory, to find answers about data compression and 
transmission rate [^ . Since then, information theory principles have been 
largely incorporated into machine learning, see for example Principe (d^ . 


2.2.1 Entropy 


Entropy {H) is a measure of uncertainty of a random variable. The uncertainty 
is related to the probability of occurrence of an event. Intuitively, high entropy 
means that each event has about the same probability of occurrence, while 
low entropy means that each event has a different probability of occurrence. 
Formally, the entropy of a discrete random variable cc, with mass probability 
p{x{i)) = Pr{x = x{i)}, x{i) G x is defined as: 


H{x) =-'^p{x{i))^og2ipix{i))). ( 1 ) 

Entropy is interpreted as the expected value of the negative of the logarithm 
of mass probability. Let x and y be two random discrete variables. The joint 
entropy of x and y, with joint mass probability p{x{i), y{j))^ is the sum of the 
uncertainty contained by the two variables. Formally, joint entropy is defined 
as follows: 

n n 

H{{x,y}) = p{x{i),y{j)) ■ \og 2 {p{x{i),y{j))). (2) 

i=i j=i 

The joint entropy has values in the range, 

ma.x{H{x),H{y)) < H{{x,y}) < H{x) + H{y). (3) 

The maximum value in inequality (|3]) , happens when x and y are completely 
independent. The minimum value occurs when x is completely dependent on 
y. The conditional entropy measures the remaining uncertainty of the random 
variable x when the value of the random variable y is known. The minimum 
value of the conditional entropy is zero, and it happens when x is statisti¬ 
cally dependent on y, i.e., there is no uncertainty in x if we know y. The 
maximum value happens when x and y are statistically independent, i.e., the 
variable y does not add information to reduce the uncertainty of x. Formally, 
the conditional entropy is defined as: 

n 

H{x\y) =J2piyij)) ■ H{x\y = y{j)) (4) 

i=i 


where, 

0 < H{x\y) < H{x), (5) 

a.ndH{x\y = y{j)) is the entropy of all a;(i), which are associated with ?/ = y{j). 
Another way of representing the conditional entropy is: 


H{x\y) =H{{x,y}) - H{y). 


( 6 ) 
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2.2.2 Mutual Information 


The mutual information (MI) is a measure of the amount of information that 
one random variable has about another variable [l^. This definition is useful 
within the context of feature selection because it gives a way to quantify the 
relevance of a feature subset with respect to the output vector C. Formally, 
the MI is defined as follows: 


I{x-,y) = EE p{x{i),yU)) ■ log 

i=i j=i 


p{x{i),y{j)) \ 

p{x{i)) -pivij))] ’ 


(7) 


where MI is zero when x and y are statistically independent, i.e., p{x{i), y{j)) = 
p{x{i))-piyij)). The MI is related linearly to entropies of the variables through 
the following equations: 

(H{x) -H{x\y) 

nx;y) = lH{y)-H{y\x) ( 8 ) 

[h{x) + H{y) - H{x,y). 

Fig. [T] shows a Venn diagram with the relationships described in ([5]). 


Let z be a discrete random variable. Its interaction with the other two 
variables {x,y} can be measured by the conditional MI, which is defined as 
follows: 

n 

I{x;y\z) = '^p{z{i))I {x;y\z = z{i )), (9) 

i=l 


where I (x;y\z = z(i)) is the MI between x and y in the context of z = z{i). 
The conditional MI allows measuring the information of two variables in the 
context of a third one, but it does not measure the information among the 
three variables. Multi-information is an interesting extension of MI, proposed 
by McGill which allows measuring the interaction among more than two 
variables. For the case of three variables, the multi-information is defined as 
follows: 


- I{x-,z) - I{y,z) 
Iiy;z\x) -I{y;z) 


( 10 ) 


The multi-information is symmetrical, i.e., I{x; y; z) = I{x\ z; y) = I(z; y, x) 
= I{y;x;z) = ... The multi-information has not been widely used in the lit¬ 
erature, due to its difficult interpretation, e.g. the multi-information can take 
negative values, among other reasons. However, there are some interesting pa- 
pers about the interaction among variables that use this concept [42 , 68, 30,0 ■ 
The multi-information can be understood as the amount of information com¬ 
mon to all variables (or set of variables), but that is not present in any subset of 
these variables. To better understand the concept of multi-information within 
the context of feature selection, let us consider the following example. 
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H(X,Y) 


H(X) 


H(Y) 


Fig. 1 Venn diagram showing the relations between MI and entropies 


Example 1 Let X 2 , be independent binary random variables. The output 
of a given system is built through the function C = xi + (x 2 © ), and Xi = xi, 

where + stands for the OR logic function and © represents the XOR logic 
function. 
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Using eq. oni) to measure the multi-information among X 2 , X 3 and C gives: 
I{x 2 '-, xz]C) = I{{x 2 ,xz}-, C) — I{x 2 ] C) — I{x 3 -, C). Noticc that the relevance of 
single features X 2 and 0:3 with respect to C is null, since I{x 2 ] C) = I{x 3 \ C) = 
0, but the joint information of {x 2 , 2 : 3 } with respect to C is greater than zero, 
/({x 2 , 2 : 3 }; C) > 0. In this case, X 2 and 2:3 interact positively to predict (7, and 
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this yields a positive value of the multi-information among these variables. The 
multi-information among the variables Xi, X4 and C is given by: I(xi; X4; C) = 
I{{xi,X 4 }; C)—I{xi; C) — I{x 4 ] C). The relevance of individual features Xi and 
X 4 is the same, i.e., I{xi, C) = /(X 4 ; C) > 0. In this case the joint information 
provided by xi and X 4 with respect to C is the same as that of each variable 
acting separately, i.e., /({xi, 2 : 4 }; C) = I{xi;C) = I{x 4 ]C). This yields a 
negative value of the multi-information among these variables. We can deduce 
that the interaction between xi and X4 does not provide any new information 
about C. Let us consider now the multi-information among xi, X 2 and C, 
which is zero: I(xi;x 2 ;C) = I({xi,X 2 };C) — I(xi;C) — I(x 2 ;C) = 0. Since 
feature X2 only provides information about C when interacting with X3, then 
/({xi, X 2 }; C) = I{xi;C). In this case, features Xi and X 2 do not interact in 
the knowledge of C. 

From the viewpoint of feature selection, the value of the multi-information 
(positive, negative or zero) gives rich information about the kind of interaction 
there is among the variables. Let us consider the case where we have a set of 
already selected features S and a candidate feature /i, and we measure the 
multi-information of these variables with the class variable C, I{fi\S]C) = 
I{S]C\fi) — I{S\C). When the multi-information is positive, it means that 
feature fi and S are complementary. On the other hand, when the multi¬ 
information is negative, it means that by adding fi we are diminishing the 
dependence between S and C, because fi and S are redundant. Finally, when 
the multi-information is zero, it means that fi is irrelevant with respect to the 
dependency between S and C. 

The mutual information between a set of m features and the class variable 
C can be expressed compactly in terms of multi-information as follows: 


I{{xi,X2,...,Xm};C) = Y. Y ^(isuci), (11) 

'==1 yS C {xi, ...,Xm} 

\S\=k 

where /([S' U C]) = /(si; S 2 ; • ■ • ;sk',C). Note that the sum on the right side of 
eq. (dH), is taken over all subsets S of size k drawn from the set {xi,..., Xm}- 


3 Relevance, Redundancy and Complementarity 

The filter approach to feature selection is based on the idea of relevance, which 
we will explore in more detail in this section. Basically the problem is to find 
the feature subset of minimum cardinality that preserves the information con¬ 
tained in the whole set of features with respect to C. This problem is usually 
solved by finding the relevant features and discarding redundant and irrele¬ 
vant features. In this section, we review the different definitions of relevance, 
redundancy and complementarity found in the literature. 
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3.1 Relevance 


Intuitively, a given feature is relevant when either individually or together with 
other variables, it provides information about C. In the literature there are 
many dehnitions of relevance, including different levels of relevance HR,!!!, 
3l|,[63,lii,l3,[II,[I3,[il. Kohavi and John used a probabilistic framework 
to dehne three levels of relevance: strongly relevant, weakly relevant, and irrel¬ 
evant features, as shown in Table [TJ Strongly relevant features provide unique 
information about C, i.e., they cannot be replaced by other features. Weakly 
relevant features provide information about C, but they can be replaced by 
other features without losing information about C. Irrelevant features do not 
provide information about C, and they can be discarded without losing infor¬ 
mation. A drawback of the probabilistic approach is the need of testing the 
conditional independence for all possible feature subsets, and estimating the 
probability density functions (pdfs) (i^ . 

An alternative dehnition of relevance is given under the framework of mu¬ 
tual information [5i,i[3i,|3i, 113) 113) mi • An advantage of this approach 


is that there are several good methods for estimating MI. The last column of 
Table [T] shows how the three levels of individual relevance are defined in terms 
of MI. 


Table 1 Levels of relevance for candidate feature fi, according to probabilistic framework 
I 31 II and mutual information framework l43ll 


Relevance 

Level 

Condition 

Probabilistic Approach 

Mutual Information 
Approach 

Strongly 

Relevant 

t 


^(A;ChA) >0 

Weakly 

Relevant 

3SC -A 

p(C'|A,-A)=p(C'hA) 

A 

p{C\fi,S)j^p{C\S) 

/(A;C'hA) = o 

A 

^(A;C|S) > 0 

Irrelevant 

V5C 

p{C\fi,S)=p(C\S) 

7(A;C|S) = 0 


The definitions shown in Table [T] give rise to several drawbacks, which are 
summarized as follows: 

1. To classify a given feature fi, as irrelevant, it is necessary to assess all 
possible subsets S of -i/,;. Therefore this procedure is subject to the curse 
of dimensionality [3, [131 . 

2. The definition of strongly relevant features is too restrictive. If two fea¬ 
tures provides information about the class but are redundant, then both 
features will be discarded by this criterion. For example, let {xi,X 2 ,X 3 } 
be a set of 3 variables, where xi = X 2 , and X 3 is noise, and the output 
class is defined as C = Xi. Following the strong relevance criterion we have 
/(X1;C'|{X2,X3}) =I(X2]C\{XI,X3}) =I{X3;C\{XI,X2}) = 0. 
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3. The definition of weak relevance is not enough for deciding whether to dis¬ 
card a feature from the optimal feature set. It is necessary to discriminate 
between redundant and non-redundant features. 


3.2 Redundancy 

Yu and Liu proposed a finer classification of features into weakly rele¬ 
vant but redundant and weakly relevant but non-redundant. Moreover, the 
authors defined the set of optimal features as the one composed by strongly 
relevant features and weakly relevant but non-redundant features. The con¬ 
cept of redundancy is associated with the level of dependency among two or 
more features. In principle we can measure the dependency of a given feature 
fi with respect to a feature subset S C -i/^, by simply using the MI, I{fi] S). 
This information theoretic measure of redundancy satisfies the following prop¬ 
erties: it is symmetric, non-linear, non-negative, and does not diminish when 
adding new features [43l |. However, using this measure it is not possible to de¬ 
termine concretely with which features of S is fi redundant. This calls for more 
elaborated criteria of redundancy, such as the Markov blanket (^ . , and 

total correlation [fi^ ■ The Markov blanket is a strong condition for conditional 
independence, and is defined as follows. 

Definition 1 (Markov blanket) Given a feature fi, the subset M C -i/^ is 
a Markov blanket of fi iff [fi^ : 

p{{F\{f ,, M}, C} I {/., M}) = p(,{F\{U , M}, C} IM). (12) 

This condition requires that M subsumes all the information that fi has 
about C, but also about all other features {F\{fi ,M}}. It can be proved that 
strongly relevant features do not have a Markov blanket . 

The Markov blanket condition given W Eq. m can be rewritten in the 
context of information theory as follows |43l |: 

/(/^;{C,-/„M}|M) = 0. (13) 

An alternative measure of redundancy is the total correlation or multi¬ 
variate correlation [^. Given a set of features F = {fi, ■■■, fm}, the total 
correlation is defined as follows: 

m 

C{h -...; /„) = ^ H{fi) - H{h ,..., /„). (14) 

i=l 

Total correlation measures the common information (redundancy) among 
all the variables in E. If we want to measure the redundancy between a given 
variable fi and any feature subset S C -i/j, then we can use the total correla¬ 
tion as: 

C{fp,S)=H{fi)+H{S)-H{f,,S), (15) 

however this corresponds to the classic definition of MI, \.e., C{fi\ S) = I{fi; S). 
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3.3 Complementarity 


The concept of complementarity has been re-discovered several times [US 


iS . 61 . T3 |. Recently, it has become more relevant because of the development 


of more efficient techniques to estimate MI in high-dimensional spaces [34127 1. 
Complementarity, also known as synergy, measures the degree of interaction 
between an individual feature fi and feature subset S given C, through the 
following expression (/(/i; S'|C')). To illustrate the concept of complementarity, 
we will start expanding the multi-information among fi, C and S. Decompos¬ 
ing the multi-information in its three possible expressions we have: 


( i { U , s \ c )- i { u , s ) 

/(/,; C) = <^ /(/,; C|5) - /(/,; C) (16) 

According to eq. (1161) , the first row shows that the multi-information can be 
expressed as the difference between complementarity S\C)) and redun¬ 

dancy {Ii ff, S)). A positive value of the multi-information entails a dominance 
of complementarity over redundancy. Analyzing the second row of eq. (1161) , we 
observe that this expression becomes positive when the information that fi 
has about C is greater when it interacts with subset S with respect to the 
case when it does not. This effect is called complementarity. The third row of 
eq. (HU), gives us another viewpoint of the complementarity effect. The multi¬ 
information is positive when the information that S has about C is greater 
when it interacts with feature fi compared to the case when it does not inter¬ 
act. Assuming that the complementarity effect is dominant over redundancy. 
Fig. HI illustrates a Venn diagram with the relationships among complemen¬ 
tarity, redundancy and relevancy. 


4 Optimal Feature Subset 

In this section we review the different definitions of the optimal feature sub¬ 
set, Sopt, given in the literature, as well as the search strategies used for ob¬ 
taining this optimal set. According to [s^, in practice the feature selection 
problem must include a classifier or an ensemble of classifiers, and a perfor¬ 
mance metric. The optimal feature subset is defined as the one that maximizes 
the performance metric having minimum cardinality. However, filter methods 
are independent of both the learning machine and the performance metric. 
Any filter method cor resp onds to a definition of relevance that employs only 
the data distribution (HJ. Yu and Liu defined the optimal feature set as 
composed of all strongly relevant features and the weakly relevant but not 
redundant features. In this section we review the definitions of the optimal 
feature subset from the viewpoint of filter methods, in particular MI feature 
selection methods. The key notion is conditional independence, which allows 
defining the sufficient feature subset as follows iHl: 
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Individual Subset 

Relevance Relevance 

Fig. 2 Venn diagram showing the relationships among complementarity, redundancy and 
relevancy, assuming that the multi-information among fi, S and C is positive. 


Definition 2 S' C F is a sufficient feature subset iff 

piC\F)=piC\S). (17) 

This definition implies that C and ->S are conditionally independent, i.e., 
-'S provides no additional information about C in the context of S. However, 
we still need a search strategy to select the feature subset S, and an exhaustive 
search using this criterion is impractical due to the curse of dimensionality. 

In probability the measure of sufficient feature subset can be expressed 
as the expected value over p{F) of the Kullback-Leibler divergence between 
p{C\F) and p{C\S) [s^. According to Guyon et al. (^, this can be expressed 
in terms of MI as follows: 

DMI{S)=I{F-C)-I{S]C). (18) 

Guyon et al. proposed solving the following optimization problem: 

mm\S\ + X- DMI{S), (19) 
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where A > 0 represents the Lagrange multiplier. If S' is a sufficient feature 
subset, then DMI{S) = 0, and eq. (fTiH) is reduced to minscj’ |S|. Since J(F; C) 
is constant, eq. (HU) is equivalent to: 

min|S| - A-/(S;C). (20) 


The feature selection problem corresponds to finding the smallest feature 
subset that maximizes /(S;C'). Since the term minscj’|S| is discrete, the 
optimization of ((20l) is difficult. Tishby et al. proposed replacing the term 
minscF |S| with I{F; S). 

An alternative approach to optimal feature subset selection is using the 
concept of the Markov blanket (MB). Remember that the Markov blanket, 
M, of a target variable C, is the smallest subset of F such that C is inde¬ 
pendent of the rest of the variables F\M. Roller and Sahami proposed 
using MBs as the basis for feature elimination. They proved that features 
eliminated sequentially based on this criterion remain unnecessary. However, 
the time needed for inducing an MB grows exponentially with the size of this 
set, when considering full dependencies. Therefore most MB algorithms imple¬ 
ment approximations based on heuristics, e.g. finding the set of k features that 
are strongly correlated with a given feature [33| . Fast MB discovery algorithms 
have been developed for the case of distributions that are faithful to a Bayesian 
Network [Hi, [ 53 . However, these algorithms require that the optimal feature 
subset does not contain multivariate associations among variables, which are 
individually irrelevant but become relevant in the context of others 0 . In 
practice, this means for example that current MB discovery algorithms cannot 
solve Example 1 due to the XOR function. 

An important caveat is that both feature selection approaches, sufficient 
feature subset and MBs, are based on estimating the probability distribution of 
C given the data. Estimating posterior probabilities is a harder problem than 
classification, e.g. in using a 0\l-loss function only the most probable classifi¬ 
cation is needed. Therefore, this effect may render some features contained in 
sufficient feature subset or in the MB of C unnecessary H3 H^, 241. 


4.1 Relation between MI and Bayes error classification 


There are some interesting results relating the MI between a random discrete 
variable / and a random discrete target variable C, with the minimum error 
obtained by maximum a posteriori classifier (Bayer classification error) 

[20| . The Bayes error is bounded above and below according to the following 
expression: 


J(/;C)+log(2) 

log(|C|) 


<e^ayes{f)^\{H{C)-I{f-C)). 


( 21 ) 


Interestingly, Eq. m shows that both limits are minimized when the MI, 
I{f\C), is maximized. 
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4.2 Search strategies 

According to Guyon et al. [^, a feature selection method has three compo¬ 
nents: 1) Evaluation criterion definition, e.g. relevance for filter methods, 2) 
evaluation criterion estimation, e.g. sufficient feature selection or MB for filter 
methods, and 3) search strategies for feature subset generation. In this sec¬ 
tion, we briefly review the main search strategies used by MI feature selection 
methods. Given a feature set F of cardinality m, there are 2™ possible subsets, 
therefore an exhaustive search is impractical for high-dimensional datasets. 

There are two basic search strategies: optimal methods and sub-optimal 
methods [^. Optimal search strategies include exhaustive search and accel¬ 
erated methods based on the moiiotonic property of a feature selection cri¬ 
terion, such as branch and bound. But optimal methods are impractical for 
high-dimensional datasets, therefore sub-optimal strategies must be used. 

Most popular search methods are sequential forward selection (SFS) 
and sequential backward elimination (SBE) (^ . Sequential forward selection 
is a bottom-up search, which starts with an empty set, and adds new features 
one at a time. Formally, it adds the candidate feature fi that maximizes I{S;C) 
to the subset of selected features S, i.e., 

5 = 5U{argmax(/({^,/,};C))}. (22) 

/.ens 

Sequential backward elimination is a top-down approach, which starts with 
the whole set of features, and deletes one feature at a time. Formally, it starts 
with S = F, and proceeds deleting the less informative features one at a time, 
i.e, 

5 = ^\{arg min(/({5\/J; C)}. (23) 

hes 

Usually backward elimination is computationally more expensive than for¬ 
ward selection, e.g. when searching for a small subset of features. However, 
backward elimination can usually find better feature subsets, because most 
forward selection methods do not take into account the relevance of variables 
in the context of features not yet included in the subset of selected features 
(^ . Both kinds of searching methods suffer from the nested effect, meaning 
that in forward selection a variable cannot be deleted from the feature set 
once it has been added, and in backward selection a variable cannot be rein¬ 
corporated once it has been deleted. Instead of adding a single feature at a 
time, some generalized forward selection variants add several features, to take 
into account the statistical relationship between variables [^. Likewise, the 
generalized backward elimination deletes several variables at a time. An en¬ 
hancement may be obtained by combining forward and backward selection, 
avoiding the nested effect. The strategy “plus-l-take-away-r” adds to S I 
features and then removes the worst r features if Z > r, or deletes r features 
and then adds I features if r < Z. 
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5 A Unified Framework for Mutual Information Feature Selection 

Many MI feature selection methods have been proposed in the last 20 years. 
Most methods define heuristic functionals to assess feature subsets combining 
definitions of relevant and redundant features. Brown et al. 0 proposed a 
unifying framework for information theoretic feature selection methods. The 
authors posed the feature selection problem as a conditional likelihood of the 
class labels, given features. Under the filter assumption [l3|, conditional like¬ 
lihood is equivalent to conditional mutual information (CMI), i.e., the feature 
selection problem can be posed as follows: 

min I S'I (24) 

SCF 

subjectto : mn/(-iS; CjS). 

This corresponds to the smallest feature subset such that the CMI is minimal. 
Starting from this objective function, the authors used MI properties to deduce 
some common heuristic criteria used for MI feature selection. Several criteria 
can be unified under the proposed framework. In particular, they showed that 
common heuristics based on linear combinations of information terrns, such 
as Battiti’s MIPS [^, conditional infomax feature extraction (CIFE) (iJ[^. 
minimum-redundancy maximum relevance (mRMR) (46| . and joint mutual 
information (JMI) [66|, are all low-order approximations to the conditional 
likelihood optimization problem. However, the unifying framework proposed 
by Brown et al. 0 fell short of deriving (explaining) non-linear criteria using 
min or max operators such as Conditional Mutual Information Maximization 
(CMIM) [U, Informative Fragments [fi^, and ICAP (2^ . 

Let us start with the assumption that I{F; C) measures all the information 
about the target variable contained in the set of features. This assumption is 
based on the additivity property of MI (l3 . , which states that the infor¬ 

mation about a given system is maximal when all features (F) are used to 
estimate the target variable (C). Using the chain rule, I{F; C) can be decom¬ 
posed as follows: 

I[F-C)=I{S-,C)+I{-^S]C\S). (25) 

As I{F;C) is constant, maximizing I{S\C) is equivalent to minimizing 
/(-'S'; CjS). Many MI feature selection methods maximize the first term on the 
right side of This is known as the criterion of maximal dependency (MD) 
[4g. On the other hand, other criteria are based on the idea of minimizing the 
CMI, i.e. the second term on the right hand side of eq. 

In the following we describe the approach of Brown et al. 0 for deriving 
sequential forward selection and sequential backward elimination algorithms, 
which are based on minimizing the CMI. For the convenience of the reader, 
we present the equivalent procedure in parallel when maximizing dependency 
(MD). In practice, a search strategy is needed to find the best feature subset. 
As we saw in section SSI the most popular methods are sequential forward 
selection and sequential backward elimination. Before proceeding we need to 
define some notation. 
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Table 2 Parallel between MD and CMI approaches for sequential forward selection 


MD 


CMI 


max 7(S*+^;C)= 

max /({S*,/i}; C)= 

Ae^s* 

max /(S’*; 01“+ max /(/i;C|S*) 
(( 

max /(/i;C|S*) 

Ae^s* 


min /(-.S*+l;C|S*+l) 

A6-S‘ 

min /(-S‘\/i;C|{S*,/i}) 

A6-S‘ 

min /(-.S*;C|S*ii+ min (-/(/i; CIS*)) 
Ae-s‘ Ae-s* ^ ' 


max /(/i;C|S*) 
A6-S‘ 


“ This term is independent of fi. 

^ This term has the same value V/i. 


5* Subset of selected variables at time t. 

fi Candidate feature to be added to or eliminated from feature subset 

5* at time t. 

fi = arg max I{fi] els'*) in forward selection. 

fi = arg min I{fi] e|S*\/i) in backward elimination. 

A6S‘ 

Sj A given feature in S*. 

-<Sj The complement set of feature Sj with set S*, i.e., -<Sj = S*\sj 

S*"*"^ Subset of selected variables at time t+1. 

S*+i^{S*,/,} in forward selection. 

gt+i ^ 

in backward elimination. 

-iS‘+^ Complement of feature subset S*+^, i.e. F = {S‘+^,-iS^+^j. 

-S‘+i ^ {-S*\/,} in forward selection. 

-S‘+i^{-S*,/,} in backward elimination. 

Table [5] shows that for the case of sequential forward selection, we achieve 
the same result when using the MD or CMI approach: the SFS algorithm 
consists of maximizing I{fi;C\S*). Analogously, Table [3] shows that for the 
case of sequential backward elimination, again we achieve the same result 
when using MD or CMI approaches: the SBE algorithm consists of minimizing 

For space limitations, we will develop here only the case of forward feature 
selection, but the procedure is analogous for the case of backward feature 
elimination. The expression I{fi;C\S*) can be expanded as follows [l^ : 

I{h-C\S*) = I{h-C) - /(/,; 5‘) + /(/,; 5*|C). (26) 

The first term on the right hand side of (ESI) measures the individual rel¬ 
evance of the candidate feature fi with respect to output C; the second term 
measures the redundance of the candidate feature with the feature subset of 
previously selected features 5"*; and the third term measures the complemen¬ 
tarity between S* and fi in the context of C. However, from the practical point 
of view, eq. (1331) presents the difficulty of estimating MI in high-dimensional 
spaces, due to the presence of the set 5”* in the second and third terms. 
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Table 3 Parallel between MD and CMI approaches for sequential backward elimination 

MD 

CMI 

max 7(5*+^; C)= 

/iSS* 

max 7(5* \/; C)= 
fies<^ 

max7(S'‘;C')^+max (-/(A; C|S*\/i)) 

/iSS* “ /iSS* 

11 

min 7(A;C|S‘\/i) 

/iSS* 

min 7(-.5*+bC|S‘+i) 

/ies‘ 

min7({-5bA}\/i;C|S‘\A) 

hes* 

min 7(-5‘;C|S*)^-f min (7(A; C|5*\A)) 

/i6S‘ “ /iGS* 

11 

min t(/i;C|S‘\A) 

fi&S* 


“ This term is independent of fi. 

^ This term has the same value V/i. 


In what follows, we take a detour from the derivation of Brown et ai 
0, using our own alternative approach. To avoid the previously mentioned 
problem, 5'*) with jS”*! = p can be calculated by averaging all expansions 
over every single feature in S, by using the chain rule as follows: 


IihS*) = 

i{ksi) + 

i{k^si\si) 

i{h\S^) = 

Hk S2) + 

I{k-'S2\S2) 

i{k.s^) = 

i{k Sp) + 

kk “'Sp Sp) 


= ^ E + ^ E (27) 

Sj^S^ Sj^S^ 

Analogously, we can obtain the following expansion for the conditional mutual 
information, S*\C): 

nk,s*\c) = ^ E nk.s,\c) + ^ E (28) 

' ' s,-gS* ' ' s,-GS* 


Substituting (l?7)l and (1^ into eq. (1^ yields: 


\ Sn^S^ Sj^S^ 


+1 ^E + ^ E m;-s,\{c,s,}) I. 


Sj^S 


1^1 


Sj^S 


(29) 


Eq. (I29|) , can be approximated by considering assumptions of lower-order de¬ 
pendencies between features Q- Features Sj G S* are assumed to have only 
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one-to-one dependencies with fi or C. Formally, assuming statistical indepen¬ 
dence: 


SjGS* 

p{h\{S\C})= n p{h\{s,,C}), (30) 

SjGS* 


we obtain the following low-order approximation: 






y: |C), 




(31) 


Notice that eq. m is an approximation of the multidimensional MI expressed 
by eq. (gSl)- Interestingly, Brown et al. 0 deduced a similar formula but with 
coefficients Vl-®*! replaced by unity constants. 

Eq. (ini) allows deriving some well-known heuristic feature selection meth¬ 
ods. When only the first two terms of Eq. m are taken into account, it 
corresponds exactly to the minimal redundance maximal relevance (mRMR) 
criterion proposed in (^. Moreover, if the term i/|S| is replaced by a user 
defined parameter /3, then we obtain the MIES criterion (Mutual Information 
Feature Selection) proposed by Battiti [J. When considering only the first 
term in eq. we obtain the MIM criterion 

Eq. (HTT]) with its three terms corresponds exactly to the Joint Mutual 
Information (JMI) (^ . fl?il | . Also it corresponds with the Conditional Infomax 
Eeature Extraction (CIEE) criterion proposed in [0, when the coefficient 
I S'* I = l,Vt. Moreover, the Conditional Mutual Information based Feature 
Selection (CMIFS) criterion proposed in is an approximation of eq. (1^^ . 
where only 0, I or 2 out of t summation terms are considered in each term. 
The CMIFS criterion is the following: 

Jcrmfs(M=I{f^;C)-I{f,;st)+ Y. I{U,Sj\C)-I(ff,St\Si). (32) 


The previously mentioned methods do not take into account the terms 
containing -iSj in eq. (1291) . This entails the assumption that fi and -iSj are in¬ 
dependent, therefore (I{fi',->Sj) = I{fi; ->Sj\C) — 0). This approximation can 
generate errors in the sequential selection or backward elimination of variables. 
In order to somehow take into account the missing terms, let us consider the 
following alternative approximation of I{fi;C\S*): 

/(/,;C|5‘) = /(/,;C)+/(/,;5‘;C) = 

/(/,;C)+/(/.;{s„-s,};C) = 

I{k. C) + /(/,; s,-C) + I(f,; ^s,;C\s,) = 
/(/.;C|s,)+J(/,;-s,;C|s,). 


( 33 ) 
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Averaging this decomposition over every single feature sj S S* we have: 

^ ^ E (34) 

' UjeS‘ I ^sjeS‘ 

The Interaction Capping (ICAP) criterion approximates eq. (1551) by 
the following expression: 

JicapUi) = I{fi\C) + ^ min(0,/(/i;Sj;C'))). (35) 

Sj gS 


In ICAP , the information of variable fi is penalized when the interac¬ 
tion between fi, Sj and C becomes redundant {I{fi', sj; C) < 0), but the com¬ 
plementarity relationship among variables is neglected when I{fi; Sj',C) > 0. 
The authors considered a Naive Bayes classifier, which assumes independence 
between variables. 

Eq. ((551) allows deriving the Conditional Mutual Information Maximiza¬ 
tion (CMIM) criterion [^, when we consider only the first term on the right 
hand side of this equation and replace the mean operator with a minimum 
operator. CMIM discards the second term on the right hand side of eg. ((551) 
completely, taking into account only one-to-one relationships among variables 
and neglecting the multi-information among fi,~'Sj and C in the context of 
Sj Vj. On the other hand, CMIM-2 criterion corresponds exactly to the 
first term on the right hand side of eq. (1551) . These methods are able to detect 
pairs of relevant variables that act complementarily in predicting the class. 
In general CMIM-2 outperformed CMIM in experiments using artificial and 
benchmark datasets [Hl^ . 

So far we have reviewed feature selection aOTroaches that avoid estimating 
MI in high-dimensional spaces. Bonev et al. |9| proposed an extension of the 
MD criterion, called Max-min-Dependence (MmD), which is defined as follows: 

JMmD{f^) = /({/z, ^}; C) - /(-{/„ 5}; C). (36) 

The procedure starts with the empty set S = $ and sequentially generates 
as: 

5‘+i=5‘u max {JMmoifi)). (37) 

fiGF\S 

The MmD criterion is heuristic, and is not derived from a principled approach. 
However, Bonev et al. jH were one of the first in selecting variables estimating 
MI in high-dimensional spaces , which allows using set of variables instead 
of individual variables. Chow and Huang (l^ proposed combining a pruned 
Parzen window estimator with quadratic mutual information , using Renyi 
entropies, to estimate directly the MI between the feature subset S* and the 
classes C, 7(5'*; C), in an effective and efficient way. 
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6 Open Problems 


In this section we present some open problems and challenges in the field of 
feature selection, in particular from the point of view of information theo¬ 
retic methods. Here can be found a iioii-exhaustive list of open problems or 
challenges. 


1. Fiirther developing a unifying framework for information theo¬ 
retic feature selection. As we reviewed in section[51 a unifying framework 
able to explain the advantages and limitations of successful heuristics has 
been proposed. This theoretical framework should be further developed 
in order to derive new efficient feature selection algorithms that include 
in their functional terms information related to the three types of fea¬ 
tures: relevant, redundant and complementary. Also a stronger connection 
between this framework and the Markov blanket is needed. Developing hy¬ 
brid methods that combine maximal dependency with minimal conditional 
mutual information is another possibility. 

2. Fiirther improving the efficacy and efficiency of information the¬ 
oretic feature selection methods in high-dimensional spaces. The 
computational time depends on the search strategy and the evaluation cri¬ 
terion [U. As we enter the era of Big Data, there is an urgent need for 
developing very fast feature selection methods able to work with millions 
of features and billions of samples. An important challenge is develop¬ 
ing more efficient methods for estimating MI in high-dimensional spaces. 
Automatically determining the optimal size of the feature subset is also 
of interest, many feature selection methods do not have a stop criterion. 
Developing new search strategies that go beyond greedy optimization is 
another interesting possibility. 

3. Fiirther investigating the relationship between mutual informa¬ 
tion and Bayes error classification. So far lower and upper bounds for 
error classification have been found for the case of one random variable 
and the target class. Extending these results to the case of mutual infor¬ 
mation between feature subsets and the target class is an interesting open 
problem. 

4. Efirther investigating the effect of a finite sample over the statisti¬ 
cal criteria employed and in MI estimation. Guyon et al. argued 
that feature subsets that are not sufficient may render better performance 
than sufficient feature subsets. For example, in the bio-informatics domain, 
it is common to have very large input dimensionality and small sample size 


5. Further developing a framework for studying the relation be¬ 
tween feature selection and causal discovery. Guyon et al. inves¬ 
tigated causal feature selection. The authors argued that the knowledge of 
causal relationships can benefit feature selection and viceversa. A challenge 
is to develop efficient Markov blanket induction algorithms for non-faithful 
distributions. 
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6. Developing new criteria of statistical dependence beyond corre¬ 
lation and MI. Seth and Principe [sij revised the postulates of measuring 
dependence according to Renyi, in the context of feature selection. An im¬ 
portant topic is normalization, because a measure of dependence defined 
on different kinds of random variables should be comparable. There is no 
standard theory about MI normalization (lol . . Another problem is that 
estimators of measures of dependence should be good enough, even when 
using a few realizations, in the sense of following the desired properties of 
these measures. Seth and Principe (5l| argued that this property is not 
satisfied by MI estimators, because they do not reach the maximum value 
under strict dependence, and are not invariant to one-to-one transforma¬ 
tions. 


7 Conclusions 

We have presented a review of the state-of-the-art in information theoretic 
feature selection methods. We showed that modern feature selection methods 
must go beyond the concepts of relevance and redundance to include comple¬ 
mentarity (synergy). In particular, new feature selection methods that assess 
features in context are necessary. Recently, a unifying framework has been 
proposed, which is able to retrofit successful heuristic criteria. In this work, 
we have further developed this framework, presenting some new results and 
derivations. The unifying theoretical framework allows us to indicate the ap¬ 
proximations made by each method, and therefore their limitations. A number 
of open problems in the field are suggested as challenges for the avid reader. 
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